{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bOKA6zjpBvsQ"
   },
   "source": [
    "# Extracting embeddings with ALBERT\n",
    "With Hugging Face transformers, we can use the ALBERT model just like how we used BERT. Let's explore this with a small example. Suppose, we need to get the contextual word embedding of every word in the sentence Paris is a beautiful city. Let's see how to that with ALBERT. \n",
    "\n",
    "Import the necessary modules: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "xNnnQICIByXI",
    "outputId": "c8322c70-13d7-45ff-8578-203ff70cb5de"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: transformers==3.5.1 in /usr/local/lib/python3.6/dist-packages (3.5.1)\n",
      "Requirement already satisfied: dataclasses; python_version < \"3.7\" in /usr/local/lib/python3.6/dist-packages (from transformers==3.5.1) (0.8)\n",
      "Requirement already satisfied: sentencepiece==0.1.91 in /usr/local/lib/python3.6/dist-packages (from transformers==3.5.1) (0.1.91)\n",
      "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.6/dist-packages (from transformers==3.5.1) (4.41.1)\n",
      "Requirement already satisfied: sacremoses in /usr/local/lib/python3.6/dist-packages (from transformers==3.5.1) (0.0.43)\n",
      "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers==3.5.1) (2019.12.20)\n",
      "Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers==3.5.1) (3.0.12)\n",
      "Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers==3.5.1) (2.23.0)\n",
      "Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from transformers==3.5.1) (20.8)\n",
      "Requirement already satisfied: tokenizers==0.9.3 in /usr/local/lib/python3.6/dist-packages (from transformers==3.5.1) (0.9.3)\n",
      "Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from transformers==3.5.1) (1.19.4)\n",
      "Requirement already satisfied: protobuf in /usr/local/lib/python3.6/dist-packages (from transformers==3.5.1) (3.12.4)\n",
      "Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers==3.5.1) (7.1.2)\n",
      "Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers==3.5.1) (1.15.0)\n",
      "Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers==3.5.1) (1.0.0)\n",
      "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==3.5.1) (3.0.4)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==3.5.1) (2020.12.5)\n",
      "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==3.5.1) (2.10)\n",
      "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==3.5.1) (1.24.3)\n",
      "Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging->transformers==3.5.1) (2.4.7)\n",
      "Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf->transformers==3.5.1) (51.0.0)\n"
     ]
    }
   ],
   "source": [
    "!pip install transformers==3.5.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "id": "NWTJqHTGBvsf"
   },
   "outputs": [],
   "source": [
    "from transformers import AlbertTokenizer, AlbertModel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5emDFrsoBvsh"
   },
   "source": [
    "\n",
    "Download and load the pre-trained Albert model and tokenizer. In this tutorial, we use the ALBERT-base model: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true,
    "id": "4rJfIZqcBvsi"
   },
   "outputs": [],
   "source": [
    "model = AlbertModel.from_pretrained('albert-base-v2')\n",
    "tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oM1qkvNcBvsj"
   },
   "source": [
    "\n",
    "Now, feed the sentence to the tokenizer and get the preprocessed input: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true,
    "id": "Oi6VIJqeBvsk"
   },
   "outputs": [],
   "source": [
    "sentence = \"Paris is a beautiful city\" \n",
    "inputs = tokenizer(sentence, return_tensors=\"pt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2fnRzxB-Bvsm"
   },
   "source": [
    "\n",
    "Let's print the inputs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "zmtgP87xBvsn",
    "outputId": "a51b8c90-f438-49e2-e20e-37ee35ae19b0"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'input_ids': tensor([[   2, 1162,   25,   21, 1632,  136,    3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}\n"
     ]
    }
   ],
   "source": [
    "print(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_p54_NRnBvsn"
   },
   "source": [
    "\n",
    "Now we just feed the inputs to the model and get the result. The model returns the hidden_rep which contains the hidden state representation of all the tokens from the final encoder layer and cls_head which contains the hidden state representation of the [CLS] token from the final encoder layer:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true,
    "id": "NW7947cUBvso"
   },
   "outputs": [],
   "source": [
    "hidden_rep, cls_head = model(**inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "F0C5UG3VBvsp"
   },
   "source": [
    "\n",
    "\n",
    "We can obtain the contextual word embedding of each word in the sentence just like BERT as:\n",
    "\n",
    "- hidden_rep[0][0] contains the contextual embedding of the token [CLS]\n",
    "- hidden_rep[0][1] contains the contextual embedding of the token 'Paris' \n",
    "- hidden_rep[0][2] contains the contextual embedding of the token 'is' \n",
    "\n",
    "Similarly in this manner, hidden_rep[0][7] contains the contextual embedding of the token 'city'. \n",
    "\n",
    "In this way, we can use the ALBERT model just like how we used the BERT model. We can also fine-tune the ALBERT model similar to how we fine-tuned the BERT model on any downstream task. Now that we learned how ALBERT works, in the next section, let us explore RoBERTa, another interesting variant of BERT."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "4.03. Extracting embeddings with ALBERT.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
